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Dropout Training for SVMs with 
Data Augmentation 

Ning Chen and Jun Zhu, Member, IEEE, Jianfei Chen and Ting Chen 


Abstract —Dropout and other feature noising schemes have shown promising results in controlling over-fitting by artificially 
corrupting the training data. Though extensive theoretical and empirical studies have been performed for generalized linear 
models, little work has been done for support vector machines (SVMs), one of the most successful approaches for supervised 
learning. This paper presents dropout training for both linear SVMs and the nonlinear extension with latent representation 
learning. For linear SVMs, to deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, 
we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm 
iteratively minimizes the expectation of a re-weighted least square problem, where the re-weights are analytically updated. For 
nonlinear latent SVMs, we consider learning one layer of latent representations in SVMs and extend the data augmentation 
technique in conjunction with first-order Taylor-expansion to deal with the intractable expected non-smooth hinge loss and the 
nonlinearity of latent representations. Finally, we apply the similar data augmentation ideas to develop a new IRLS algorithm 
for the expected logistic loss under corrupting distributions, and we further develop a non-linear extension of logistic regression 
by incorporating one layer of latent representations. Our algorithms offer insights on the connection and difference between 
the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of 
dropout training on significantly boosting the classification accuracy of both linear and nonlinear SVMs. In addition, the nonlinear 
SVMs further improve the prediction performance on several image datasets. 

Index Terms —Dropout, SVMs, logistic regression, data augmentation, iteratively reweighted least square 

--f- 


1 Introduction 


Artificial feature noising augments the finite training data 
with a large (or even infinite) number of corrupted versions, 
by corrupting the given training examples with a fixed noise 
distribution. Among the many noising schemes, dropout 
training Ga is an effective way to control over-fitting of 
large deep networks by randomly omitting subsets of neu¬ 
rons (or features) at each iteration of a training procedure. 
By formulating the feature noising methods as minimizing 
the expectation of some loss functions under the corrupting 
distributions, recent work has provided theoretical under¬ 
standings of such schemes from the perspective of adaptive 
regularization ll46l : and has shown promising empirical 
results in various applications, including document clas¬ 
sification ED, 1461 , named entity recognition 11 4811 , image 
classification Ea, tag recommendation CSI, etc. 

Regarding the loss functions, though much work has 
been done on the quadratic loss, logistic loss, or the log- 
loss induced from a generalized linear model (GLM) 1411 . 
ll46l , 1481 , little work has been done on the margin-based 
hinge loss underlying the very successful support vector 
machines (SVMs) fl3l . One technical challenge is that 
the non-smoothness of the hinge loss makes it hard to 
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compute or even approximate its expectation under a given 
corrupting distribution. Existing methods are not directly 
applicable, therefore calling for new solutions. This paper 
attempts to address this challenge and fill up the gap by 
extending dropout training as well as other feature noising 
schemes to support vector machines. 

Previous efforts on learning SVMs with feature noising 
have been devoted to either explicit corruption or an ad¬ 
versarial worst-case analysis. For example, virtual support 
vector machines Cl explicitly augment the training data, 
which are usually support vectors from previous learning 
iterations for saving computational cost, with a finite num¬ 
ber of additional examples that are corrupted through some 
invariant transformation models. A standard SVM is then 
learned on the corrupted data. Though simple and effective, 
such an approach lacks elegance and the computational cost 
of processing the extra corrupted examples could be pro¬ 
hibitive for many applications. The other work EH, ca, 
l3^ adopts an adversarial worst-case analysis to improve 
the robustness of SVMs against feature deletion in testing 
data. Though rigorous in theory, a worst-case scenario is 
unlikely to be encountered in practice. Moreover, the worst- 
case analysis usually results in solving a complex and 
computationally demanding problem. 

In this paper, we perform an average-case analysis and 
show that it is efficient to train linear SVM and nonlinear 
SVM predictors with latent representation learning on an 
infinite amount of corrupted copies of the training data by 
marginalizing out the corruption distributions. We concen¬ 
trate on dropout training, but the results are directly extensi¬ 
ble to other noising models, such as Gaussian, Poisson and 
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Laplace ED. For all these noising schemes, the resulting 
expected hinge loss can be upper-bounded by a variational 
objective by introducing auxiliary variables, which follow 
a generalized inverse Gaussian distribution 1301 . We apply 
the similar ideas on the expected logistic loss by intro¬ 
ducing Polya-Gamma 1^ distributed auxiliary variables. 
Specifically, we make following contributions: 

(1) We develop an iteratively re-weighted least square 
(IRLS) algorithm for dropout training of linear SVMs 
for both classification and regression. By minimizing a 
variational objective based on data augmentation, our 
algorithm minimizes the expectation of a re-weighted 
quadratic loss under the given corrupting distribution 
at each iteration, where the re-weights are computed 
in a simple closed form; 

(2) We generalize the data augmentation ideas to develop 
an IRLS algorithm for dropout training of nonlinear 
SVMs that consist of one hidden layer for representa¬ 
tion learning. In order to deal with the non-smoothness 
of the expected hinge loss and the nonlinearity of the 
latent feature extractors, we apply Taylor’s expansion 
to derive an approximate objective, and then optimize 
it with an iterative algorithm; 

(3) We further generalize the above ideas to develop IRLS 
algorithms for dropout training of logistic regression 
with and without one layer of nonlinear hidden units; 
By sharing similar structures as those for SVMs, our 
IRLS algorithms shed light on the connection and 
difference between the hinge loss and logistic loss in 
the context of dropout training, complementing to the 
previous analysis 1^ . ifTTI in the supervised learning 
settings; 

(4) We present empirical results on several image and text 
classification tasks and a challenging “nightmare at 
test time” scenario na. Our results demonstrate the 
effectiveness of our approaches, in comparison with 
various strong competitors. 

The rest paper is structured as follows. Section [previews 
the related work. Section [3] introduces the framework of 
learning with marginalized corrupted features. Section 
presents both linear and nonlinear dropout SVMs for classi¬ 
fication and regression, with an iteratively re-weighted least 
square (IRLS) algorithm. Section presents both linear 
and nonlinear dropout logistic regression with new IRLS 
algorithms. Section presents empirical results. Section 
concludes with future directions discussed. 

2 Related Work 

Dropout training has been recognized as an effective feature 
noising strategy for neural networks by randomly dropping 
hidden units during training m One representative dropout 
strategy is the standard “Monte Carlo” dropout or the 
explicit corruption 03, (361, which has been applied in 
neural networks to prevent the feature co-adaptation effect 
and improve prediction performance in many applications, 
e.g., image classification Il22l . E3, (23l, document classifi¬ 
cation ED, M, named entity recognition (481 . tag recom¬ 


mendation ESI, online prediction with expert advice (42l, 
spoken language understanding (28l, etc. Dropout training 
also performs well on standard machine learning models, 
e.g., DART, an ensemble model of boosted regression trees 
using dropout training (32l. 

In contrast to the standard “Monte-Carlo” dropout, in this 
paper, we focus on the class of models that are considered 
to be deterministic versions of dropout by marginalizing 
the noise. These models are formalized as marginalized 
corrupted features (MCF), and do not need the random 
selection. It is possible to get gradients for the marginal¬ 
ized loss functions. Representative work on MCF includes 
the marginalization denoising autoencoders for domain 
adaptation 13 and learning nonlinear representations O 
and marginalized dropout noise in linear regression (36l. 
Besides, E3 explores the idea of marginalized dropout 
for speed-up, and ED develops several loss functions 
in the context of empirical risk minimization framework 
under different input noise distributions. Moreover, the 
MCF framework have also been developed for link predic¬ 
tion ns, multi-label prediction oa, image tagging nni 
and distance metric learning (211. 

Both theoretical and empirical analyses have shown that 
the dropout training under MCF is equivalent to adding a 
regularization effect into the model for controlling over¬ 
fitting. (4^ describes how dropout can be seen as an adap¬ 
tive regularizer, and ES proposes a theoretical explanation 
for why dropout training has been successful on high¬ 
dimensional single-layer natural language tasks. The result 
is that Dropout preserves the Bayes decision boundary and 
should therefore induce minimal bias in high dimensions. 
m develops a pseudo-ensemble by applying dropout in 
perturbing the parent model and examines the relationship 
to the standard ensemble methods by presenting a novel 
regularizer based on the noising process. Other work (T\\ 
analyzes some underlying problems, e.g., when the dropout- 
regularized criterion has a unique minimizer and when 
the dropout-regularization penalty goes to infinity with the 
weights. fTf\ sheds light on the dropout from a Bayesian 
standpoint, which enables us to optimize the dropout rates 
for better performance. 

Though much work has been done on marginalizing the 
quadratic loss, logistic loss, or the log-loss induced from a 
generalized linear model (GLM) (Til, (T^l, (48l . little work 
has been done on the margin-based hinge loss underlying 
the very successful support vector machines (SVMs) ES 
as discussed in Section 1. The technical challenge is that 
the non-smoothness of the hinge loss makes it hard to 
compute or even approximate its expectation under a given 
corrupting distribution. Existing methods are not directly 
applicable. This paper attempts to address this challenge 
and fill up the gap by extending dropout training as well 
as other feature noising schemes to SVMs. Finally, some 
preliminary results were reported in (Till and this paper 
presents a systematical extension. 
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3 Preliminaries 

We setup the problem in question and review the learning 
with marginalized corrupted features. 

3.1 Regularized loss minimization 

Consider the binary classification, where each training 
example is a pair (x, y) with x G being an input feature 
vector and y G {+1,-1} being a binary label. Given a set 
of training data V = {(x^, supervised learning 

aims to find a function f ^ T that maps each input to a 
label. To find the optimal candidate, it commonly solves a 
regularized loss minimization problem 

min ^^(/) + 2c.7^(P;/), (1) 

J fc*' 

where 7^(P; /) is the empirical risk of applying / to the 
training data; f^(/) is a regularization term to control over¬ 
fitting; and c is a non-negative regularization parameter. 
Note that we include the factor “2” simply for notation 
clarity as will be clear soon. 

For linear models, the function / is simply parameterized 
as /(x; w, 6) = w^x+6, where w is the weight vector and 
b is an offset. We will denote 6 = {^v, for clarity. Then, 
the regularization can be any Euclidean norm^ e.g., the 
-^ 2 -norm, f](w) = ||w|| 2 , or the ^i-norm, Q(w) = ||w||i. 
For the loss functions, the most relevant measure is the 
training error, which however 

is not easy to optimize. A convex surrogate loss is used 
instead, which normally upper bounds the training error. 
Two popular examples are the hinge loss and logistic los^ 

N 

Tlh{T>; 6) = '^ max {0,£- 2/n/(x„; 6 )), 

n=l 

N 

ni{V-,e) = {-logp{yn\Xn,6)) , 

n=l 

where i (> 1) is the required margin, and = 

1/(1 + exp(—^^/(x^; 6))) is the logistic likelihood. Other 
losses include the quadratic loss, Yln=iif 
and the exponential loss, Yln=i ^^P(“^n/(xn; 0)), whose 
feature noising analyses are relatively simpler Bdl . 

3.2 Learning with marginalized corruption 

Let X be the corrupted version of the input features x. Con¬ 
sider the commonly used independent corrupting model: 

D 

p(x|x) = Wp{xd\xd\r]d), 
d=l 

where each individual distribution is a member of the 
exponential family, with the natural parameter Another 
common assumption is that the corrupting distribution is 
unbiased, that is, Ep[x|x] = x, where we use Ep[-] = 

1. It is a common practice to not regularize the offset. 

2. The natural logarithm is not an upper bound of the training error. 
We can simply change the base without affecting learning. 


IEp(x|x)['] to denote the expectation taken over the corrupt¬ 
ing distribution p(x|x). Such examples include the unbiased 
blankout (or dropout) noise, Gaussian noise, Laplace noise, 
and Poisson noise EH, ED. 

For the explicit corruption Q, each example (x^,^^) is 
corrupted M times from the corrupting model p(xn|xn), 
resulting in the corrupted examples (x^m^^n). ^ ^ [^]- 
This procedure generates a new corrupted dataset V with a 
larger size of NM. Then, the model can be learned on the 
generated dataset by minimizing the average loss function 
over M corrupted data points: 

N ^ M 

^) = E E Vn, 0), (2) 

n=l m=l 

where 7^(x, y; 6) is the loss function of the model incurred 
on the training example (x, ^). As C{T)] 0) scales linearly 
with the number of corrupted observations, this approach 
may suffer from a high computational cost whenever M is 
moderately large. 

Dropout training adopts the strategy of implicit corrup¬ 
tion, which learns the model with marginalized corrupted 
features by minimizing the expectation of a loss function 
under the corrupting distribution 

N 

£(P;0) = ^Ep[7^(x„,2/„;0)]. (3) 

n=l 

The objective can be seen as a limit case of © when 
M ^ oo, by the law of large numbers. Such an expectation 
scheme has been widely adopted 1461 . ATI . l48l . l47l . 

The choice of the loss function can make a sig¬ 

nificant difference, in terms of computation cost and predic¬ 
tion accuracy. Previous work on feature noising has covered 
the quadratic loss, exponential loss, logistic loss, and the 
loss induced from generalized linear models (GLM). For 
the quadratic loss and exponential loss, the expectation in 
Eq. Q can be computed analytically, thereby leading to 
simple gradient descent algorithms BTl . However, it does 
not have a closed form to compute the expectation for 
the logistic loss or the GLM log-loss. Previous analysis 
has resorted to approximation methods, such as using the 
second-order Taylor expansion 14^ or an upper bound by 
applying Jensen’s inequality 1411 . both of which lead to 
effective algorithms in practice. In contrast, little work has 
been done on the hinge loss, for which the expectation 
under corrupting distributions cannot be analytically com¬ 
puted either, therefore calling for new algorithms. 

4 Dropout Support Vector Machines 

In this section, we present dropout training for both lin¬ 
ear SVMs and its nonlinear extension with representation 
learning in the context of classification and regression. 

4.1 Linear SVMs with Corrupting Noise 

For linear SVMs, the expected hinge loss can be written as 

N 

Ch{V\ e) = Y IEp[max (0, Cn)], (4) 

n=l 
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where we define (^n — ^ — ^n(w^Xn)|^ Following the 
regularized loss minimization framework, we define the 
optimization problem of SVMs with marginalized corrupted 
features as 

min||w ||2 + 2c-£/i(P;0). (5) 

0 

Below, we present a simple iteratively re-weighted least 
square (IRLS) algorithm to solve this problem. Our method 
consists of a variational bound of the expected loss and a 
simple algorithm that iteratively minimizes an expectation 
of a re-weighted quadratic loss. We also apply the similar 
ideas to develop a simple IRLS algorithm for minimizing 
the expected logistic loss in Section thereby allowing for 
a systematical comparison of the hinge loss with the logistic 
and quadratic losses in the context of feature noising. 


4.1.1 A variational bound with data augmentation 
Since we do not have a closed-form expression of the 
expectation of the max function, it is intractable to 
directly solve problem ( 0 . Here, we derive a varia¬ 
tional upper bound based on a data augmentation for¬ 
mulation of the expected hinge loss. Specifically, let 
= exp{— 2 cmax( 0 , Cn)} be the unnormalized 
pseudo-likelihooqj of the response variable for sample n. 
Then we have 


2c-Chip-,e) = -^Ep[log(/)( 2 /„|x„, 0 )]. (6) 

n 

Using the ideas of data augmentation |[30l, |[49l , the pseudo¬ 
likelihood can be expressed as 

4^\yn\^ni — J y^ 27rX \ 2A J ^ 

where is the augmented variable associated with data 
n. Using and Jensen’s inequality, we can derive a vari¬ 
ational upper bound of the expected hinge loss multiplied 
by the factor 2 c (i.e., 2 c • Ch{'D] 9)) as 


N 

Ch{e,q{X)) = -if (A) + {-E,[logA„] 


( 8 ) 


n=l 


r 1 


+ cCn)^ I + 


where H{X) is the entropy of the variational distribution 
q{X) with A = {Xn}n=i^ c' is a constant; and we have de¬ 
fined Eg[-] = ^q(x) [•] to denote the expectation taken over a 
variational distribution q. Now, our variational optimization 
problem is 


min 

o,qix)ev 


||w||^ +£ft(6>,g(A)), 


(9) 


where V is the simplex space of normalized distributions. 
We should note that when there is no feature noise (i.e., 
X = x), the bound is tight and we are learning the standard 
SVM classifier. Please see Appendix A for the derivation. 
We will empirically compare with SVM in experiments. 


3. We treat the offset b implicitly by augmenting Xn and Xn with one 
dimension of deterministic 1. More details will be given in the algorithm. 

4. Pseudo-likelihood has been widely used in statistics. Here, we simply 
mean that the likelihood is not well normalized. 


4.1.2 Iteratively Re-weighted Least Square Algorithm 
In the upper bound, we note that when the variational 
distribution g(A) is given, the term Ep[(A^ -f c^^)^] is an 
expectation of a quadratic loss, which can be analytically 
computed. We leverage this property and develop a coordi¬ 
nate descent algorithm to solve problem Our algorithm 
iteratively solves the following two steps, analogous to the 
common procedure of a variational EM algorithm. 

For q{X) (i.e.. E-step): when the parameters 6 are 
given, this step involves inferring the variational distribu¬ 
tion q{X). Specifically, optimizing jC over g(A), we get 
q{X) = q{Xn) and each term is: 



where the second-order expectation is 

]Ep[Cn] = w"^(Ep[x„]Ep[x„]"^ + Vp[x„])w 

-2£y„w"'"Ep[x„]+C; (11) 


and Vp [x^] is slDxD diagonal matrix with the dth diagonal 
element being the variance of under the corrupting 
distribution p(xn|xn). We have denoted QIQ{x;p,a,b) oc 
exp(—1(| + ax)) as a generalized inverse Gaussian 
distribution (161. Thus, A“^ follows an inverse Gaussian 
distribution 


q{Xp\±n,e)=IG{x^ 


- 1 . 
n 5 






( 12 ) 


For 6 = w (i.e., M-step): by fixing q{X) and removing 
irrelevant terms, this substep minimizes the objective: 


N 


C[e] = 


n=l 


cCn + ^7nC^ 


(13) 


where 7 ^ = Eg[A“^]. We observe that this substep is 
equivalent to minimizing the expectation of a re-weighted 
quadratic loss, as summarized in Lemma 1, whose proof is 
deferred to Appendix B for brevity. 


Lemma 1. Given q{X), the M-step minimizes the re- 
weighted quadratic loss (with the £ 2 -norm regularizer): 


where = {I -\- -^)yn A the re-weighted label, and the 
re-weights are computed in closed-form: 

For low-dimensional data, we can do matrix inversion to 
get the closed-form solutiorQ 

w = / + ^7„Ep [x„x^] j [x„] j ,(16) 


5. To consider offset, we simply augment x and x with an additional 
unit of 1. The variance Vp[yin] is augmented accordingly. The identity 
matrix / is augmented by adding one zero row and one zero column. 
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where Ep[xnXn] = Ep[xn]Ep[xn]~^ + ^p[xn]. However, 
if the data lies in a high-dimensional space, e.g., text 
documents in a bag-of-words vector space with tens of 
thousands of dimensions, the above matrix inversion will 
be computationally expensive. In such cases, we can use 
numerical methods, e.g., the quasi-Newton method 1^ . to 
efficiently solves for 6. 

To summarize, our algorithm iteratively minimizes the 
expectation of a simple re-weighted quadratic loss under the 
given corrupting distribution, where the re-weights 7 ^ are 
computed in an analytical form. Therefore, it is an extension 
of the classical iteratively re-weighted least square (IRLS) 
algorithm ll^ in order to deal with dropout training. We 
also observe that if we fix 7 ^ at ^ and set ^ = 0 , we are 
minimizing the quadratic loss under the corrupting distribu¬ 
tion, as studied in ED. We will empirically show that our 
iterative algorithm for the expected hinge-loss consistently 
improves over the standard quadratic loss by adaptively 
updating 7 ^. Finally, as we assume that the corrupting 
distribution is unbiased, i.e., ¥.p[xnd\xnd] = we only 
need to compute the variance of the corrupting distribution, 
i.e., Vp[xnd] = dropout distribution, which is 

easy for all the existing exponential family distributions. An 
overview of the variance of the commonly used corrupting 
distributions can be found in ATI . 


4.2 Dropout SVMs with Representation Learning 

We have assumed that the classifier is a linear model with 
respect to the input features. This assumption can be relaxed 
by learning a nonlinear representation, as popularized in 
representation learning El. Here we present an extension 
to learn a nonlinear mapping of the input features. 

Let g(x; a) denote a iT-dimensional nonlinear transfor¬ 
mation of the L)-dimensional input features x, parameter¬ 
ized by a D X AT matrix a. For example, we can define the 
logistic transformation, each element k is 


^/e(x; a) = Sigmoid(a^x) = 


1 

1 + exp(-a^x) ’ 


where cxk is the kth column of a. We then define our linear 
discriminant functior |3 as 


/(x;w,a) = w"^g(x; a) 

where w G is the vector of classifier weights. We still 
let (n = ^ — ^n(w^g(x; a)). Then, we have the same 
expected hinge loss as in Eq. 0- 

Using the same data augmentation technique, we can 
derive a variational upper bound of the expected hinge loss 
as in Eq. 0 , again with the new definition of Cn • However, 
note that the variational bound is also a function of a. With 
the nonlinear transformation, the challenge is on computing 
the variational bound, which is intractable in general. Here, 
we apply the Taylor-expansion of g(-) in order to get an 
approximation. Specifically, we have 

g(x„) « g'(x„) = g(|i„)+Vig(Ai„)'^(x„ - (17) 


where Vxg(Mn) the first-order derivative of g(x, a) with 
respect to x evaluated aX Si D x K matrix with each 
element being VstgdkifJ-n) = 9k{^^n)V-9k{^^n))o!■dk\ and 
= Ep[xn] is the mean of the corrupted features. Eor 
unbiased corrupting noise, we have = x^. 

With the first-order Taylor expansion, we can compute 
the variational bound, which involves the variance of the 
corruption. Then, an alternating minimization algorithm can 
be developed to iteratively update the following steps: 


For q{X)'. the solution has the same form as in (10): 


q{\n) = QIQ[\n\\Xc^^p[C] 


but with the new definition of Cn- Under the Taylor- 
expansion, we have the second-order expectation: 


Ep[C] = w’^(Ep[g'(x„)g'(x„)’^])w 
-2£2/nw’^Ep[g'(x„)] +e 

where Ep[g'(x„)] = g(/i„), Ep[g'(x„)g'(x„)"r] = 

+ Vig(/i„)"^l/p[x„]Vig(/i„), and yp[x„] 
is again sl D x D diagonal matrix with the dth diagonal 
element being the variance of under the corrupting 
distribution p(xn | x^). 

For v^: this step is similar as in the linear case, but with 
subtle change on the features. For ease of computation, we 
keep the objective that only includes w: 


N 


= l|w| 




2 ^ 

n=l 


^Cn 


InCn 


where the coefficient 7 ^ is computed as: 


7n — ~ 


1 




(18) 


(19) 


If the number of latent features is not large (i.e., K is small), 
we can get the optimal solution in the same closed-form as 
(16), by simply replacing x^ with g'(x^). However, if K 
is large, we must resort to numerical methods (e.g., quasi- 
Newton methods) and use the derivative for w: 


dC[^] 

dwk 


2 Wk + C^y 2 '^^{^y^~yn) 9 k{tJ'n) 

n 

^ ^j ^p[^nd]hnd^X.9dk{l^n)^ ^ 
d 


where Wkgkil^J, yl^ = {i ^ ^)yn and Kd = 

x9dk{l^n)' 

For a: this is the new step, which can be done by 
gradient descent. The objective including all the a is 


N 


OL 


■ 

n=l 


^Cn + 2 


( 20 ) 


and now a could be solved using gradient descent. The 
gradient for adk is 


d^jcx] 

dOLdk 


^ ^ 7n f (^n yn)gnd 

n 



6. The offset is again ignored for simplicity. 


Wkhki 
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where the coefficients are computed as pkd = (1 + (1 ~ 

2gk{Hn))o^dklJ-nd) and rik = gkl/J-Ji^ - 9k{tJ-n))- 

In summary, the nonlinear SVMs can be learned via 
coordinate descent by iteratively updating 7 with Eq. 
solving for the model parameters w by minimizing 
in and solving for the transformation weights a by 
minimizing C\oA in ( 20 ). 


4.3 Dropout SVMs for Regression 

We briefiy discuss how to extend the above ideas to the 
regression task, where the response variable Y takes real 
values. For regression, a widely used loss for support vector 
regression (SVR) models |[35ll is the e-insensitive loss: 

N 

= E]max(0, |A„| - e), (21) 

n=l 

where — Vn — is the difference between the 

true value and the model prediction, and e is a pre-defined 
positive constant. For dropout training, we consider the ex¬ 
pected loss Ce = Ep[7^e]. To deal with the intractability, we 
develop a similar IRES algorithm with data augmentation. 
Specifically, let 6) = exp{—2cmax(0, |A^| —e)} 

be the pseudo-likelihood of the response variable for sample 
n. We have 


2c-CdT>]6) = -y^Ep[log</?(y„|x„,0)]. (22) 

n 


By noting the equality that max(0, |A^| — e) = 
max( 0 , An — e)-\- max( 0 , — A^ — e) and applying the ideas 
of data augmentation in Eq. ([T]), we have: 


f r ^ / ^n + c(A 

noo ^ 

X / ex 

Jo 




2Xn 

LUn c(^An c) 


dXn 

duJr) 


where (A^, are a pair of augmented variables associated 
with data n. Then, using this data augmentation expression 
and Jensen’s inequality, we can derive a variational upper 
bound jC^ of the expected e-insensitive loss again multiplied 
by the factor 2 c (i.e., 2 c • £e(T>; 9)) as 


N 

C^e-, qiX, a>)) = -HiX, u;) + ^ { -E,[log A„ + log 


UJnq 


n=l 




^p{Xn Y c{Ad — €))^ Ep(cj^ — c(Ac^ + e)) 


2 -, 


2 A, 


+ 


2uJrj 


}+c' 


where u;) is the entropy of the variational distribution 
Then the optimization problem for marginalized 
corrupted SVR is 


min \\w\\l + Cdd-,q{X,u;)). (23) 

e,q(x,oj)ev 


In the upper bound, we note that when the variational 
distribution q{X^uj) is given, the term Ep[(A^ -f c(A^ — 
e))^] is an expectation of a quadratic loss, which can be 
analytically computed. Similar as the classification case. 


we develop an IRES algorithm for problem ( [23] ) with the 
following two steps. 

For q{X^u}) (i.e.. E-step): infer the variational distribu¬ 
tion g(A, cj). Optimize C over g(A, u;), we get g(A, u;) = 
rin ^{Xn)q{oOn) and each term is: 

q{Xn) = QIQ 1, c^Ep[(A„ - e)^]^ , 

g(w„) = QIQ 1 , c^Ep[(A„ + e)^]^ , 

where the second-order expectations are Ep[(A„ — e)^] = 
w'^(Ep[x„]Ep[x„]'^ l/p[x„])w - 2{yn - e)w'^Ep[x„] 

{Vn - ef and Ep[(A„ e)^] = w'^(Ep[x„]Ep[x„]'^ 
V"p[xn])w - 2(y„ -h e)w'^Ep[x„] -h (y„ -h e)^. Thus, A“i 
and cu~^ follow inverse Gaussian distributions: 


q(A„^jx„,0)=ig |^A„L 
q(uj-^lx„,0)=xg (w“L 


1 


cv'Ep[(A„ - e)2] 

1 

C\/Ep[(A„ -h e)2] 


,lj (24) 
,1 1 .(25) 


For 0 = w (i.e., M-step): removing the irrelevant terms, 
this step involves minimizing the following objective: 

2 ^ 

^[6] = l|w||2 + — Ep [jn{An — + dn{An + c)^] , 

n=l 

where 7 ^ = Eg[A“^] and Sn = 'Eq[uj~^]. Similar as in 
Lemma it can be observed that this substep can also 
be equivalent to solving a re-weighted quadratic loss, as 
summarized in Lemma [2j 

Lemma 2. Given q{X^ lv), the M-step minimizes the re¬ 
weighted quadratic loss (with the £ 2 -norm regularizer): 

l|w ||2 + Y , (26) 

n 

where = {yn + is the re-weighted response, 

and the re-weights are computed in closed-form: 

1 ^ 1 
Cv/Ep[(A„-e)2]’ " cv/Ep[(A„ + e)2]- 

Similar as in the classification case, we can solve for 
the closed-form solution by using matrix inversion for low¬ 
dimensional data, while for high-dimensional data, we must 
resort to numerical approaches. 


5 Dropout Logistic Regression 

In this section, we develop a new IRES algorithm for 
dropout training of logistic regression and its extension 
to learn latent representations for classification. Our IRES 
algorithm also iteratively minimizes the expectation of a re¬ 
weighted quadratic loss under the corrupting distribution 
and computes the re-weights analytically. Such an IRES 
algorithm allows us to draw comparisons with SVMs. 
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5.1 Logistic Regression with Corrupting Noise 

Define uun = w^x^. The expected logistic loss under a 
corrupting distribution is 


Ci{V;w) 


N 




log 


1 + 


(27) 


Again since the expectation cannot be computed in closed- 
form, we derive a variational bound as a surrogate. Specif¬ 
ically, let pseudo¬ 

likelihood of the response variable for sample n. We have 
c-Ci{V;w) = -^^Ep[logV^(^n|xn,w)]Q Using data 
augmentation techniques |l29l, ifT^ . the pseudo-likelihood 
can be expressed as 

^/’(yn|x„,w) = J p{\n)dXn, ( 28 ) 

where = f and is the augmented Polya-Gamma 
variable following distribution p{\n) ^ VQ{\n]c^^). Us¬ 
ing ( [^ , we can derive the upper bound of the expected 
logistic loss multiplied by the factor c (i.e., c • Ci{V] v^)): 


1 

£;(w,?(A)) = -if(A) + ^ {-E,[A„]Ep[a;2] (29) 

n=l 

Eg[logp(A72)] — [cCtt,] I” H“ c , 

and get the variational optimization problem 

min llv^ll^ + C[{^, q{X)), (30) 

w,g(A)e'P 

where g(A) is the variational distribution 

We solve the variational problem with a coordinate 
descent algorithm as follows: 

For g(A) (i.e., E-step): optimizing C' over g(A), we 
have q{X) = q{Xn) and each term is: 

g(A„) ocexp ^-iA„Ep[w^]^ p(A„|c,0) 

= Vg (^A„;c,^EpK]^ , (31) 


The M-step is actually equivalent to minimizing the 
expectation of a re-weighted quadratic loss, as in Lemma 
The proof is similar to that of Lemma 1 and the expectation 
of a Polya-Gamma distribution follows 1^ . 

Lemma 3. Given q{X), the M-step minimizes the re- 
weighted quadratic loss (with the £ 2 -norm regularizer) 

l|w||^ + |^7iEp[(w^x„-y')2], (34) 


where yj, = 2 ^ 2 /ra the re-weighted label, 'jh = ^ and 


7„ =Eq[A„] = 




2 v/e;K] 


(35) 


It can be observed that if we fix 7 ^ = |, the IRLS 
algorithm reduces to minimizing the expected quadratic loss 
under the corrupting distribution. This is similar as in the 
case with SVMs, where if we set ^ = 0 and fix 7 ^ = ^, 
the IRLS algorithm for SVMs essentially minimizes the 
expected quadratic loss under the corrupting distribution. 
Furthermore, by sharing a similar iterative structure, our 
IRLS algorithms shed light on the similarity and difference 
between the hinge loss and the logistic loss, as summarized 
in Table 1. Specifically, both losses can be minimized 
via iteratively minimizing the expectation of a re-weighted 
quadratic loss, while they differ in the update rules of the 
weights 7 n and the labels pn at each iteration. 


5.2 Dropout LR with Representation Learning 

Similar as in Section |4.2[ we extend the logistic regression 
(LR) to learn latent representations under the dropout 
learning context. Specifically, let a) G denote 
the nonlinear transformation of the input features x, pa¬ 
rameterized by a. For the D-dimensional input x, let 
K denote the transformed feature dimension. We again 
consider the logistic transformation, where each element 
k is 5 '/c(x; a) = Sigmoid(a^x). We then define our linear 
discriminant functior [3 as 


which is a Polya-Gamma distribution with Ep[cc;^] = 
w"^(Ep[x„]Ep[x„]"^ + Vp[x„])w. 

For w (i.e., M-step): by fixing q{X) and removing 
irrelevant terms, this step minimizes the objective 

c 

Gw] = l|w||l + E (32) 

n=l 

If the data is not in a high dimensional space, we can get 
the optimal solution in a closed-forn 0 

W = ^/ + ^^^EjA„]Ep[x„xGj (33) 

However, if the data is high dimensional, we must resort 
to efficient numerical methods, similar as in SVMs. 

7. We drop the constant factor 2 again for notation simplicity. 

8. The offset can be similarly incorporated as in the hinge loss. 


/(x;w,a) = w"'"g(x;a), 


where w G is the classifier weights. We still let Cn = 
w^g(x; a). Then, we have the same expected logistic loss 
as in Eq. ( [ZT] ). 

Using the same data augmentation technique, we can 
derive a variational upper bound of the expected logistic 


loss in Eq. (27). We further need to deal with the nonlinear 
feature transformation, which renders the variational bound 
intractable. Here, we adopt the same strategy of using first- 
order Taylor-expansion of g(-) as in Eq. (17) to get an 
approximation around the mean corrupted features = 
Ep[x^], and then compute the variational bound, which 
basically involves the variance of the corruption. Then, 
an alternating minimization algorithm can be developed to 
iteratively update the following steps: 


9. The offset is again ignored for simplicity. 
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TABLE 1 

Comparison of hinge ioss, iogistic ioss and quadratic ioss under the IRLS aigorithmic framework. 



Hyper-parameters 

Re-weights Update 

RpHiiPtinn tn OiiaHT'atif’ T n<s<s 

Types 

parameter 1 

parameter c 

update 7 rz 

update yn 

llVJll IVJ V^LldU-lClllV./ 

Hinge-loss 

i 

c 

Eq. ^ 

15 

Vn 

< = 0 , 7n = i 

Logistic-loss 

- 

c 

Eq. ^ 

35 

yh 

7n = f 


TABLE 2 

For q{\): the solution has the same form as in ([^; A summary of the 9 datasetS. 


q{K)=rQ (^A„;c,^Ep[a]^ , 


but with the new definition of Under the Taylor- 
expansion, we have the second-order expectation as: 

Ep[Cn] = w‘^(Ep[g'(x„)g'(x„)‘^])w 

where Ep[g'(xn)] = g(M„) and Ep[g'(x„)g'(xn)"^] = 
+ Vxg(M„)'^^p[Xn]Vxg(M„)- 
For w: this step is similar as in the linear case, but with 
subtle changes on the features. Specifically, by ignoring the 
irrelevant terms, we optimize the following objective: 

iv 1 ^ 

^[w] — IIWII 2 + 2^n^p[Cn] ~ 


^V^Ep[c2 ] i 


. If if is not 


whem 4 E,|AJ = 

large, we can solve this subproblem m a closed-form as in 
(33), with Xn replaced by g'(xn). If K is large, we can 
resort to numerical methods (e.g., quasi-Newton methods) 
and use the derivative for v^: 


dwk 


2Wk C ^ ^ Jn({yn yn)9k{fJ'n) 

n 

^ ^ ^ [^nd] (Mn)l • 

d 


where yn = vh = ^In^Vn and Kd = 

'^k^xQdkifJ'n)’ 

For a: this step involves optimizing the objective: 


C\ry] = l|a| 




/ V r^^n'^piC 

n=l 






(37) 


with a gradient descent method, where the gradient is 

~ 20Ldk 3“ C 'S ^ ( {ijji yfi)ldnd 

O^dk ^ ^ 

3 “^p [^nd] hndPkd^ "^kVk 5 

and pkd and r]k are the same as in Section |4.2| 

In summary, the nonlinear logistic regression can be 
learned via coordinate descent by iteratively updating 7 , 
solving for model parameters v^ by minimizing in 
( [^ , and solving for the transformation weights a by 
minimizing £ 1^,1 in ([37). 


Dataset 

Train Size 

Test Size 

Feature Dim 

Categories 

Amazon-books 

2,000 

4,465 

20,000 

2 

Amazon-kitchen 

2,000 

5,945 

20,000 

2 

Amazon-dvd 

2,000 

3,586 

20,000 

2 

Amazon-electronics 

2,000 

5,681 

20,000 

2 

Dmoz 

7184 

1796 

16,498 

16 

Reuters 

5,946 

2,347 

18,933 

65 

CIFAR-10 

50,000 

10,000 

8,192 

10 

MNIST 

60,000 

10,000 

784 

10 

Hotelreview 

2,500 

2,500 

12,000 

Regression 


6 Experiments 

We now present empirical results on classification, regres¬ 
sion and the challenging “nightmare at test time” sce¬ 
nario na to demonstrate the effectiveness of the dropout 
training algorithm for SVMs, denoted by (linear) Dropout- 
SVM and its nonlinear version Dropout-LatentSVM, and 
the new IRLS algorithms for the dropout training of the 
logistic loss, denoted by (linear) Dropout-LR and its non¬ 
linear version Dropout-LatentLR. 

6.1 Datasets and Settings 

We evaluate our proposed models for classification and 
regression on 9 datasets, including 1) Amazon review El: 
four types of product text review datasets including books, 
kitchen, dvd and electronics. The binary classification task 
is to distinguish whether a review content is positive or 
negative; 2 ) Dmoz: a large collection of webpages organized 
in a tree hierarchy with 16 categories; 3) Reuters: a dataset 
with the documents appeared on the Reuters newswire in 
1987 with 65 categories; 4) CIFAR^^ the subset of the 
80 million tiny images |[40l . It consists of 10 classes of 
32 X 32 tiny images. We follow the experimental setup of the 
previous work 1^ . lITTI : 5) MNIST: a dataset that consists 
of 60,000 training and 10,000 testing handwritten digital 
images from 10 categories (i.e., 0, • • • ,9). The images are 
represented by 28 x 28 pixels which results in the feature 
dimension of 784; and 6 ) Hotelreview ll50l : a dataset that 
consists of 5,000 hotel reviews randomly collected from 
TripAdvisor. Each document is associated with a global 
rating score, ranging from 1 to 5. We normalize the rating 
scores as in 1501 for regression. Table [^ summarizes the 
statistics of these datasets. 

We consider the unbiased dropout (or blankout) noise 
modej^ that is, p(x = 0 ) = g and p(x = j^x) = 1 — g, 
where g G [0,1) is a pre-specified corruption level. Then, 
the variance for each dimension d is Vp[xd] = 


10. http://www.cs.toronto.edu/~kriz/cifar.html 

11. Other noise models (e.g., Poisson) were shown to perform worse 
than the dropout model El]. We have similar observations for Dropout- 
SVM and the new IRLS algorithm for logistic regression. 
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Fig. 2. Classification errors on the Amazon datasets. Best viewed in color. 


0.17 

0.165^ 

0.16 

O 

^ 0.155 

c 

o 

'■(0 0.15 

_o 

w 0.145 
_C0 

^ 0.14 

0.135 


Number of corrupted copies 

Fig. 1. Comparison between Dropout-SVM and the ex¬ 
plicit corruption for SVM on the Amazon-books dataset. 

6.2 Linear Dropout Classifiers 

We first compare the marginalized corruption of Dropout- 
SVM with the explicit corruption strategy for SVM, and 
then evaluate our linear dropout classifiers on both binary 
and multi-class classification to show the effectiveness of 
dropout training on linear SVMs and logistic regression. 

6.2.1 Dropout-SVM vs. Explicit corruption 

Fig. shows the classification errors on the Amazon-books 
dataset when a SVM classifier is trained using the explicit 
corruption strategy as in Eq. 0. We change the number 
of corrupted copies (i.e., M) from 1 to 256. Following 
the previous setups 1411 . for each value of M we choose 
the dropout model with q selected by cross-validation. The 
hyper-parameter of the SVM classifier is also chosen via 
cross-validation on the training data. We can observe a 
clear trend that the error decreases when the training set 
contains more corrupted versions of the original training 
data, i.e., M gets larger in Eq. Q. It also shows that the 
best performance is obtained when M approaches infinity, 
which is equivalent to our Dropout-SVM. 

6.2.2 Binary classification 

We evaluate Dropout-SVM and Dropout-LR on binary clas¬ 
sification tasks. We use the four Amazon review datasets 
as detailed in Table The task is to distinguish whether a 
review content is positive or negative. 

We compare our methods with those presented in 1411 
that minimize the quadratic loss with marginalized cor¬ 
rupted features (MCE), denoted by MCF-Quadratic, and 


- • - Explicit corruption 
O Dropout-SVM 






that minimize the expected logistic loss, denoted by MCF- 
Logistic. MCF-Logistic was shown to be the state-of-the- 
art method for dropout training on these datasets, out¬ 
performing a wide range of competitors, including the 
dropout training of the exponential loss and the various 
loss functions with a Poisson noise model. As we have 
discussed, both Dropout-SVM and Dropout-LR iteratively 
minimize the expectation of a re-weighted quadratic loss, 
with the re-weights updated in closed-form. We include 
MCF-Quadratic as a baseline to demonstrate the effective¬ 
ness of our methods on adaptively tuning the re-weights 
to get improved results. We implement both Dropout-SVM 
and Dropout-LR using C-f-f, and solve the re-weighted least 
square problems using L-BFGS methods 12611 . which are 
very efficient by exploring the s^sity of bag-of-words 
features when computing gradient^ 

Fig. 1^ shows classification errors, where we cite the 
results of MCF-Logistic and MCF-Quadratic from im. We 
can see that on all datasets, Dropout-SVM and Dropout- 
LR generally outperform MCF-Quadratic except when the 
dropout level is larger than 0.9, suggesting that adaptively 
updating the re-weights can improve the performance. In 
the meanwhile, the proposed two models give comparable 
results with (a bit better than on the kitchen dataset) 
the state-of-art MCF-Logistic which means that dropout 
training on SVMs is an effective strategy for binary clas¬ 
sification. Finally, by noting that Dropout-SVM reduces to 
the standard SVM when the corruption level q is zero, we 
can see that dropout training can significantly boost the 
classification performance for the simple linear SVMs. 

6.2.3 Multi-class classification 

We also evaluate on multiclass document/image classifica¬ 
tion tasks, using DMOZ, Reuters and CIFAR-10 datasets. 
There are various approaches to applying binary Dropout- 
SVM and Dropout-LR to multiclass classification, includ¬ 
ing “one-vs-all” and “one-vs-one” strategies. Here we 
choose “one-vs-all”, which has shown effectiveness in 
many applications 1^ . The hyper-parameters are selected 
via cross-validation on the training set. 

Document classification: Fig. shows the classification 
errors on the DMOZ and Reuters dataset. It can be observed 

12. We don’t compare time with MCF methods, whose implementation 
(http://homepage.tudelft.nl/19j49/mcf/Marginalized_Corrupted_Features.html) 
are in Matlab, slower than ours. 
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(a) DMOZ dataset (b) Reuters dataset 

Fig. 3. Classification errors on Dmoz and Reuters 
datasets. 

TABLE 3 

Classification errors on the CIFAR-10 dataset. 


Model 

No Corrupt 

Dropout 
q = 0.2 

Dropout 
q = 0.3 

Dropout-SVM 

0.322 

0.291 

0.293 

Dropout-LR 

0.312 

0.291 

0.290 

MCF-Logist 

0.325 

0.296 

0.294 

MCF-Quadratic 

0.326 

0.322 

0.321 


that all methods can successfully boost the performance 
with dropout training; besides, Dropout-SVM performs 
comparably with Dropout-LR (or better on DMOZ dataset) 
for different dropout levels, and significantly outperforms 
MCF-Quadratic, which demonstrates the effect of updating 
reweights in our IRLS algorithm. Moreover, Dropout-SVM 
performs slightly better than the state-of-the-art method 
(i.e., MCF-Logistic) under the dropout training setting. This 
is consistent with the binary classification observations. 

Image classification: Table presents the results on 
CIFAR-10 image dataset, where the results of quadratic 
loss and logistic loss under the MCF learning settinj^ 
are cited from ED. We can see that all the methods 
(except for the quadratic loss) can significantly boost the 
performance by adopting dropout training. Meanwhile, both 
Dropout-SVM and Dropout-LR are competitive, in fact 
achieving comparable performance as the state-of-the-art 
method (i.e., MCF-Logistic) under the dropout training 
setting. Both Dropout-SVM and Dropout-LR outperform 
MCF-Quadratic which demonstrate the effect of updating 
reweights in our IRLS algorithm. 


6.2.4 Regression 

We evaluate the dropout support vector regression (SVR) 
model on predicting rating scores for hotel review dataset. 
We compare with the MCF-quadratic model ED, which 
refers to the standard least square with marginalized 
dropout training. We use predictive R^ as the measure¬ 
ment lO, which is defined as = 1 — , where 

Hd is the ground-truth response, yd is the predicted value, 
and y is the mean of all the responses. 

Fig. 1^ shows the predictive R^ score. We can see: 
1) Dropout-SVR outperforms MCF-Quadratic for all the 
dropout levels, which implies the discriminative power of 
our Dropout-SVR using the e-insensitive loss; 2) noting 


13. The exponential loss was shown to be worse; thus omitted. 



Fig. 4. Prediction R2 on the Hotelreview dataset. 
TABLE 4 

Classification errors on the CIFAR-10 dataset. 


Model 

No Corrupt 

Dropout 
q = 0.2 

Dropout 
q = 0.3 

Dropout-LatentSVM (K = 10) 

0.292 

0.2^1 

0.285 

Dropout-LatentSVM (K = 20) 

0.291 

0.275 

0.271 

Dropout-LatentSVM (K = 30) 

0.289 

0.269 

0.266 

Dropout-LatentSVM (K = 40) 

0.288 

0.268 

0.265 

Dropout-LatentLR (K = 10) 

0.295 

0.280 

0211 

Dropout-LatentLR (K = 20) 

0.290 

0.271 

0.268 

Dropout-LatentLR (K = 30) 

0.285 

0.264 

0.264 

Dropout-LatentLR (K = 40) 

0.284 

0.262 

0.261 


that Dropout-SVR reduces to the standard SVR when the 
corruption level q is zero, Dropout-SVR can successfully 
improve the performance except when corruption level q 
is larger than 0.8. This means that our dropout training 
strategy is also effective for regression tasks. The best 
regression performance is obtained when 0.1 <g<0.5. 


6.3 Classifiers with Representation Learning 

We evaluate the Dropout-LatentSVM and Dropout- 
LatentLR on both image and text classification tasks. 

For image classification. Table shows the errors of 
different nonlinear models on CIFAR-10 dataset. We can 
see that the prediction performance is significantly im¬ 
proved by the nonlinear Dropout-SVM and Dropout-LR, 
especially when using dropout training. The best perfor¬ 
mance is obtained when latent dimension is 40. These 
results demonstrate that our Dropout training strategy is 
very effective for nonlinear classifiers. 

For text classification. Fig. shows the errors of 
Dropout-LatentSVM and Dropout-LatentLR on Amazon 
review datasets. We have following observations: 1) both 
methods perform comparably on the four datasets, which 
is not surprising due to the very similar IRLS algorithms; 
2) dropout training can consistently boost the classification 
performance for both Dropout-LatentSVM and Dropout- 
LatentLR, compared with the standard nonlinear classifiers 
when the dropout level q equals to zero; 3) the nonlinear 
Dropout classifiers do not obtain significant improvements 
compared with the linear classifiers on the document classi¬ 
fication task, probably because the words are already high- 
level representations or the simple fully connected network 
is not suitable for text documents. 
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(a) books (b) kitchen (c) dvd (d) electronics 

Fig. 5. Classification errors of Dropout-LatentSVM and Dropout-LatentLR on the Amazon datasets. 


6.4 Nightmare at test time 


Finally, we evaluate our methods under the “nightmare at 
test time” ifT^ supervised learning scenario, where some in¬ 
put features that were present when building the classifiers 
may “die” or be deleted at testing time. In such a scenario, 
it is crucial to design algorithms that do not assign too 
much weight to any single feature during testing, no matter 
how informative it may seem at training. Previous work has 
conducted the worst-case analysis as well as the learning 
with marginalized corrupted features. We take this scenario 
to test the robustness of our dropout training algorithms for 
linear Dropout-SVM, Dropout-LR as well as the nonlinear 
Dropout-LatentSVM and Dropout-LatentLR. 

We follow the setup of BTl and choose the MNIST 
dataset. We train the models on the full training set, and 
evaluate the performance on different versions of test set in 
which a certain level of the features are randomly dropped 
out, i.e., set to zero. We compare the performance of our 
dropout learning algorithms with the state-of-art MCF- 
predictors that use the logistic loss and quadratic loss. These 
two models also show the state-of-art performance on the 
same task to the best of our knowledge. We also compare 
with FDROP Ha, which is a state-of-the-art algorithm 
for the “nightmare at test time” setting that minimizes the 
hinge loss under an adversarial worst-case analysis. During 
training, we choose the best models over different dropout 
levels via cross-validation. For both Dropout-SVM and 
Dropout-LR, we adopt the “one-vs-all” strategy as above 
for the multiclass classification task. 

Fig. |6(a)| shows the classification errors of linear Dropout 
classifiers compared with other state-of-the-art methods as 
a function of the random deletion percentage of features 
at the testing time. Following previous settings, for each 
deletion percentage, we use a small validation set with 
the same deletion level to determine the regularization 
parameters and the dropout level q on the whole training 
data. From the results, we can see that the proposed 
Dropout-SVM is consistently more robust than all the other 
competitors, including the two methods to minimize the 
expected logistic-loss, especially when the feature deletion 
percentage is high (e.g., > 50%). Comparing with the 
standard SVM (i.e., the method Hinge-L2) and the worst- 
case analysis of hinge loss (i.e., Hinge-FDROP), Dropout- 
SVM consistently boosts the performance when the deletion 
ratio is greater than 10%. As expected, Dropout-SVM 



(a) Linear Models (b) Non-linear Models 

Fig. 6. Classification errors of nightmare at test time 
on MNIST dataset. 


also significantly outperforms the MCF method with a 
quadratic loss (i.e., MCF-Quadratic), which is a special 
case of Dropout-SVM as shown in our theory. Finally, 
we also note that our iterative algorithm for the logistic- 
loss works slightly better than the previous algorithm (i.e., 
MCF-Logistic) when the deletion ratio is larger than 50%. 

Fig. 6 (b)| shows the errors of nonlinear dropout classifiers 
compared with linear classifiers as a function of the random 
deletion percentage of features at the testing time. It can be 
observed that the nonlinear classifiers with latent represen¬ 
tation significantly boost the prediction performance, which 
is consistent with the previous studies in the literaturj^ 
Furthermore, by noting that Dropout-LatentSVM reduces to 
the standard LatentSVM with 1-layer perceptron, which is 
the case when the dropout level equals to zero, the dropout 
training strategy consistently boosts the performance when 
the deletion ratio is greater than 20%. Both Dropout- 
LatentSVM and Dropout-LatentLR are competitive for all 
types of feature deletion. 


6.5 Latent Representation Visualization 

We examine various characteristics of the learned latent 
features by Dropout-LatentSVMs to show its ability in 
learning predictive latent subspace representations. 

6.5.1 MNIST 

We take a careful examination of each dimension in the 
discovered latent subspace of Dropout-LatentSVM. Note 
that our Dropout-LatentSVM is a binary classifier, and we 

14. http: //yann .lecun.com/exdb/mnist/index. html 
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k = 2,Wk = 17.734 


positive 


k = 11, wk = 7.2059 


k = 55, Wk = 0.4145 


neutral 


k = 59, Wk = —0.3355 


k = 13, Wk = -7.3152 


negative 


k = 5,Wk = -12.938 




Fig. 7. (L) example images with highest responses to hidden units discovered by one of the binary classifiers 
(0 VS. others) of a 60 hidden-unit Dropout-LatentSVM model on the MNIST dataset. (R) average probabilities of 
each topic on representing images from the 10 categories. 



Fig. 8. Visualization on the CIFAR-10 dataset discovered by one of the binary classifiers (airplane vs. others) 
of a 40 hidden-unit Dropout-LatentSVM. (L) svm parameter values sorted in a descending order; (R) example 
images with highest responses to different hidden units that are associated with different svm parameter values. 


use the “one v.s. others” strategy for multi-class classifi¬ 
cation on the MNIST dataset. Here we choose the result 
of “0 v.s. others” classifier to visualize the discriminative 
latent representations. Fig. [7] shows six example hidden 
units (each unit corresponds to one dimension in the latent 
subspace) discovered by the Dropout-LatentSVM. For each 
hidden unit, we show six top-ranked images that yield 
higher expected value of ^^(x), together with the SVM 
parameter Wk. On the right side of Fig. we show the 
average probability of each category distributed on the 
particular hidden unit. We can see that images with different 
Wk values are very discriminative and predictive for several 
categories. For example, the first two hidden units (k = 2 
and k = 11) with very positive Wk values are discriminative 
in predicting the category “0”; the last two hidden units 
(k = 55 and k = 59) with very negative Wk values are 
good at discriminating a subset of categories {2, 3, 5, 7, 8} 
against “0”; while the middle two hidden units (k = 13 and 
k = b) with Wk values close to zero are kind of neutral and 


tend to represent all the categories. 

6.5.2 CIFAR-10 

Similarly, we examine the latent representations learned by 
Dropout-LatentSVM on the CIFAR-10 dataset. In Fig. 
we take the “airplane vs. others” binary classifier with 
40 hidden units as an example to visualize the latent 
representations. On the left is the SVM parameter Wk value 
sorted in descending order. The right side of Fig. shows 
the 3 groups of top-ranked example images of each latent 
dimension, where each group is associated with a different 
Wk value (e.g., positive, neutral, negative), respectively. We 
can see that the latent subspace representations discovered 
by Dropout-LatentSVM is very expressive and discrimina¬ 
tive. For example, the subgroup {2,17,19, 27} with positive 
Wk values is more representative of category “airplane”; 
and the subgroup {13,1,10,11} with negative Wk values 
tends to represent category {“automobile”, “truck”, “ship”, 
“cat”}; the subgroup {16,3,40,4} are neutral and tend 
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(a) books (b) kitchen (c) dvd (d) electronics 

Fig. 9. Sensitivity of Dropout-LatentSVM to Latent Dimension on Amazon review datasets. 


Dropout-LatentSVM 

1 1 Dropout-LatentLR 

1 1 Dropout-SVM 

0.06 

0.05 


Dropout-LatentSVM 
t ~ 1 Dropout-LatentLR 

1 1 Dropout-SVM 

1 1 Dropout-LR 

8 


1 1 Dropout-LR 

MCF-Logistic 

i 0.04 


^■MCF-Logistic 

MCF-Quadratic 

Z3 

D. 


MCF-Quadratic 
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Fig. 10. Time comparison of different models on 
amazon books dataset. 

to represent the categories {“automobile”, “truck”, “bird”, 
“cat”, “horse”, etc.}. 

6.6 Sensitivity to Latent Dimension 

To provide more insights about the behavior of Dropout 
classifiers, we investigate the prediction performance of 
nonlinear classifiers with respect to the latent dimen¬ 
sions. Fig. shows the classification errors of Dropout- 
LatentSVM on four Amazon review datasets with different 
latent dimensions. We can see that Dropout-LatentSVM 
using different dropout levels are insensitive to the latent 
dimensions on all datasets. We have similar observations 
for Dropout-LatentLR. 

6.7 Time Complexity 

Fig. compares the time efficiency of both linear and 
nonlinear Dropout-SVM, Dropout-LR models with MCF- 
Logistic and MCF-Quadratic models on the Amazon-books 
review dataset. The four proposed models (i.e., Dropout- 
SVM, Dropout-LR, Dropout-LatentSVM and Dropout- 
LatentLR) are implemented in C-l-l, and we use the matlab 
implementation of MCF-Logistic and MCF-Quadratic. All 
the models are run on a 3.40GHz desktop with 4GB 
RAM. For training, we can observe that: 1) the time cost 
of linear classifiers (i.e., Dropout-SVM and Dropout-LR) 
are comparable with (slightly faster than) MCF-Logistic 
and MCF-Quadratic models, which shows the efficiency 
of our proposed methods; 2) Dropout-SVM and Dropout- 
LR models are more efficient than the Dropout-LatentSVM 
and Dropout-LatentLR models, which is reasonable as 
nonlinear classifiers need to learn one-hidden layer percep- 
tron. For testing, all the models are deterministic and very 
efficient for making predictions. 


7 Conclusions AND Future Work 

We present dropout training for both linear SVMs and its 
nonlinear extension by learning latent features, with an iter¬ 
atively re-weighted least square (IRLS) algorithm by using 
data augmentation techniques. Similar ideas are applied to 
develop a new IRLS algorithm for the dropout training of 
logistic regression. Our IRLS algorithms provide insights 
on the connection and difference among various losses in 
dropout learning settings. Empirical results on various tasks 
demonstrate the effectiveness of our approaches. 

For future work, it is remained open whether the kernel 
trick can be incorporated in dropout learning. We are also 
interested in developing more efficient algorithms, e.g., on¬ 
line dropout learning, to deal with even larger datasets, and 
investigating whether Dropout-SVM can be incorporated 
into a deep learning architecture or learning with 
latent structures |[49ll and in the context of hierarchical 
Bayes networks 1241 . We are also interested in designning 
better and more informed dropout policies, e.g., using 
reinforcement learning techniques 121. 
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Appendix A: Variational Upper Bound 


distributions that p(a;) = p{f{x))\^^\, we have q{Xn) = 
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We provide details on deriving the variational bound of 
the expected hinge loss in 0 . To simplify notations, we 
derive the bound for a single data point. For a dataset with 
N examples, a simple summation will give the final bound. 
Define g{6;:x.) = Ep[log 0(^|x, 0)]. We have 


where the last equality is due to the fact that or 

equivalently g(/in) is an inverse Gaussian distribution as 
shown in Eq. ( [Til l. □ 
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where A is the augmented variable and c' is a constant. Note 
that if there is no uncertainty in the feature corruption (e.g., 
the corruption level in the dropout or blankout noise is 0 ), 
the bound is tight. That is, the optimal solution of q will 
give the original hinge loss. 


Appendix B: Proof of Lemma 1 


Proof: Ignore the ^ 2 -norm regularizer, we have the 
objective of the M-step: 
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where 7 „ = Eq[A“^]. Using the definition of Cn — f' — 
and ignoring the constants, we have the simplified 
objective function (again without the ^ 2 -regularizer): 
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where = (^ + i)yn is the re-weighted label. 

We now derive the equations to compute 7 ^. Let x = A^, 
and f{x) = By the transformation rule of probability 



















